Introduction to Cubic Spline Regression

Cubic regression spline is a form of generalized linear models in regression analysis. Also known as B-spline, it is supported by a series of interior basis functions on the interval with chosen knots. Cubic regression splines are widely used on modeling nonlinear data and interaction between variables. Unlike traditional methods such as polynomial regression or broken stick regression, cubic regression spline takes both smoothness and local influence into consideration (Faraway, 2015). According to Julian J Faraway, the basis functions need to be a continuous cubic polynomial that is nonzero on an interval defined by knots and zero anywhere else to ensure the local influence. Also smoothness require their first and second derivatives at each knotpoint to be continuous (Faraway, 2015). More details of cubic regression spline can be found at link

Data Summary

In this project, we’ll be using data set uswages from R package faraway. The data set can be found at link The description of variables are as follows:

kable(var_explain, caption = "Variable description")
Variable description
Variables Type Explanation
wage double Real weekly wages in dollars
educ integer Years of education
exper integer Years of experience
race integer 1 if Black, 0 if White (other races not in sample)
smsa integer 1 if living in Standard Metropolitan Statistical Area, 0 if not
ne integer 1 if living in the North East
mw integer 1 if living in the Midwest
so integer 1 if living in the West
we integer 1 if living in the South
pt integer 1 if working part time, 0 if not

In this tutorial, we will focus on the relationship between response variable wage and prediction variable experience. The two-dimensional relationship is easier to present and allows us to better illustrate cubic regression spline method.

Applying Cubic Regression Spline with R

Data Cleaning

Regression spline methods are easy to apply with R packages splines. To load data and draw plots, package faraway and ggplot2 are also needed.

library(splines)
library(faraway)
library(ggplot2)
library(dplyr)
data(uswages)
ggplot(uswages, aes(exper, wage))+geom_point(col = "slategrey") + ggtitle("Relationship between weekly wages and years of experience")

From the plot, we can observe that there are a few outliers with extremely high wage. To avoid the influence of outliers on regression models, we remove observations with wage larger than 4000.

uswages = filter(uswages, wage<4000)

First, let’s try to capture the relationship with ordinary least square model (ols).

fit1 = lm(wage~exper, data = uswages)
plot(uswages$exper, uswages$wage, xlab = "Weekly wage", 
     ylab = "Experience", main = "OLS model", col = "slategrey")
abline(fit1, col = "red")

From the plot, we can see that OLS model fails to catch most of the variabnce in the data.

Polynomial regression is a good alternative in this case. The linear models with polynomial of degree 2 and degree 4 are shown as follows:

g2 = lm(wage~poly(exper, 2), data = uswages)
g4 = lm(wage~poly(exper, 4), data = uswages)
uswages = mutate(uswages, degree2 = fitted(g2), degree4 = fitted(g4))
ggplot(uswages, aes(exper, wage)) +
  geom_point( col = "slategrey") + 
  geom_line(aes(exper, degree2,color = "2"))+
  geom_line(aes(exper, degree4,color = "4")) +
  scale_color_manual(values = c(
    '2' = 'darkblue',
    '4' = 'red')) +
  labs(color = 'Polynomial degree')+
  ggtitle("Polynomial regression models")

cubic_spline = lm(wage~bs(exper, knots = c(0, 20, 40, 58)), data = uswages)
uswages = mutate(uswages, smooth = fitted(cubic_spline))
ggplot(uswages, aes(exper, wage)) + 
  geom_point(col = "slategrey") +
  geom_line(aes(exper, smooth), col = "red") + 
  ggtitle("Cubic regression spline model")

Applying Cubic Regression Spline with Python

  1. Loading packages In this python tutorial, pandas is required for data manipulation, statsmodel is require for building regression models and matplolib is for plotting.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error
from patsy import dmatrix
from scipy import interpolate
  1. Import dataset using pandas
# read dataset
data = pd.read_csv("uswages.csv")
print(data.head())
##    Unnamed: 0    wage  educ  exper  race  smsa  ne  mw  so  we  pt
## 0        6085  771.60    18     18     0     1   1   0   0   0   0
## 1       23701  617.28    15     20     0     1   0   0   0   1   0
## 2       16208  957.83    16      9     0     1   0   0   1   0   0
## 3        2720  617.28    12     24     0     1   1   0   0   0   0
## 4        9723  902.18    14     12     0     1   0   1   0   0   0
  1. Data exploration
print(data.describe())
##          Unnamed: 0         wage     ...               we           pt
## count   2000.000000  2000.000000     ...       2000.00000  2000.000000
## mean   14015.171000   608.117865     ...          0.21000     0.092500
## std     8099.947724   459.832629     ...          0.40741     0.289803
## min        7.000000    50.390000     ...          0.00000     0.000000
## 25%     7028.750000   308.640000     ...          0.00000     0.000000
## 50%    13946.500000   522.320000     ...          0.00000     0.000000
## 75%    21117.250000   783.480000     ...          0.00000     0.000000
## max    28140.000000  7716.050000     ...          1.00000     1.000000
## 
## [8 rows x 11 columns]
print(data.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000 entries, 0 to 1999
## Data columns (total 11 columns):
## Unnamed: 0    2000 non-null int64
## wage          2000 non-null float64
## educ          2000 non-null int64
## exper         2000 non-null int64
## race          2000 non-null int64
## smsa          2000 non-null int64
## ne            2000 non-null int64
## mw            2000 non-null int64
## so            2000 non-null int64
## we            2000 non-null int64
## pt            2000 non-null int64
## dtypes: float64(1), int64(10)
## memory usage: 172.0 KB
## None

We can see that wage and experience has a non-linear relationship.

## exper vs wage ##
data_x = data[['exper']]
data_y = data[['wage']]
#visualize the relationship between experience and wage
plt.scatter(data_x, data_y, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 1. Relationship between experience and wage', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Looking at Fig 1., there seems to be an outlier (the wage that is above 7000), therefore, in this tutorial, we will remove that outlier and continue our model building.

# remove outlier
data_ylim = data.loc[data['wage']<= 7000]
wage = data_ylim[['wage']]
exper_x = data_ylim[['exper']]
#visualize the relationship between experience and wage
plt.clf()
plt.scatter(exper_x, wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 2. Relationship between experience and wage (remove outlier)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

  1. Simple Linear Regression First we will add an intercept in our simple linear regression model, then use sm.OLS from statsmodels to do the analysis.
#add an intercept (beta_0) to our model
exper_x = sm.add_constant(exper_x)  
# model fitting
model = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions1 = model.predict(exper_x) 
print(model.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                   wage   R-squared:                       0.029
## Model:                            OLS   Adj. R-squared:                  0.029
## Method:                 Least Squares   F-statistic:                     59.88
## Date:                Tue, 27 Nov 2018   Prob (F-statistic):           1.59e-14
## Time:                        10:16:16   Log-Likelihood:                -14935.
## No. Observations:                1999   AIC:                         2.987e+04
## Df Residuals:                    1997   BIC:                         2.989e+04
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const        503.1140     16.198     31.060      0.000     471.347     534.881
## exper          5.5164      0.713      7.738      0.000       4.118       6.915
## ==============================================================================
## Omnibus:                     1156.252   Durbin-Watson:                   1.957
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            16087.929
## Skew:                           2.445   Prob(JB):                         0.00
## Kurtosis:                      16.009   Cond. No.                         38.7
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

By looking at Fig 3., apparently, we can’t use simple linear regression to explain the relationship between wage and experience, since there exists a non-linear association between wage and experience.

# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5)
plt.suptitle('Fig 3. Relationship between experience and wage: using simple linear regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

# Calculating RMSE value
rms1 = sqrt(mean_squared_error(wage, predictions1))
print(rms1)
## 425.13535436123
  1. Polynomial Regression Next we will try using polynomial regression to access the relationship by setting “experience” to degree = 2. Again, we will use sm.OLS from statsmodel to do this analysis.
# refit model using polynomial regression ("exper" with degree = 2)
exper_x['exper2'] = np.power(exper_x['exper'], 2)
# model fitting
model2 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions2 = model2.predict(exper_x) 
print(model2.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                   wage   R-squared:                       0.133
## Model:                            OLS   Adj. R-squared:                  0.132
## Method:                 Least Squares   F-statistic:                     152.9
## Date:                Tue, 27 Nov 2018   Prob (F-statistic):           1.65e-62
## Time:                        10:16:16   Log-Likelihood:                -14822.
## No. Observations:                1999   AIC:                         2.965e+04
## Df Residuals:                    1996   BIC:                         2.967e+04
## Df Model:                           2                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const        268.2870     21.573     12.436      0.000     225.979     310.595
## exper         38.8749      2.262     17.190      0.000      34.440      43.310
## exper2        -0.7334      0.047    -15.453      0.000      -0.826      -0.640
## ==============================================================================
## Omnibus:                     1214.317   Durbin-Watson:                   1.969
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            19762.288
## Skew:                           2.557   Prob(JB):                         0.00
## Kurtosis:                      17.530   Cond. No.                     1.97e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.97e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

The polynomial curve on Fig 4. does explain the non-linear relationship between wage and experience.

# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df = sm.add_constant(x_lim_df) 
# find fitted value using x_lim
fit_reduce = model2.predict(x_lim_df)
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce, color = 'blue', linewidth = 1.5, label='experience with degree = 2')
plt.legend()
plt.suptitle('Fig 4. Relationship between experience and wage: using polynomial regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

# Calculating RMSE value
rms2 = sqrt(mean_squared_error(wage, predictions2))
print(rms2)
## 401.78144699617144
  1. Cubic Regression We select 0, 20, 40, 57 as our knots, and use sm.GLM from statsmodels to build cubic regression model.
# cubic spline with 4 knots at 5, 15, 25, 40
cubic_x = dmatrix("bs(data, knots = (0, 20, 40, 57), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model3 = sm.GLM(wage, cubic_x).fit()
# find fitted value
predictions3 = model3.predict(cubic_x)
print(model3.summary())
# reduce samples down to 100
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:                   wage   No. Observations:                 1999
## Model:                            GLM   Df Residuals:                     1991
## Model Family:                Gaussian   Df Model:                            7
## Link Function:               identity   Scale:                      1.6183e+05
## Method:                          IRLS   Log-Likelihood:                -14821.
## Date:                Tue, 27 Nov 2018   Deviance:                   3.2220e+08
## Time:                        10:16:16   Pearson chi2:                 3.22e+08
## No. Iterations:                     3   Covariance Type:             nonrobust
## ===============================================================================================================================
##                                                                   coef    std err          z      P>|z|      [0.025      0.975]
## -------------------------------------------------------------------------------------------------------------------------------
## Intercept                                                     193.5365    261.141      0.741      0.459    -318.290     705.363
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[0]    12.3372    274.287      0.045      0.964    -525.255     549.929
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[1]   302.5933    259.491      1.166      0.244    -205.999     811.186
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[2]   709.6090    272.728      2.602      0.009     175.071    1244.147
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[3]   482.8562    267.447      1.805      0.071     -41.330    1007.042
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[4]   193.8713    284.069      0.682      0.495    -362.894     750.637
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[5]    31.3605    323.113      0.097      0.923    -601.930     664.651
## bs(data, knots=(0, 20, 40, 57), include_intercept=False)[6]   -28.7265    479.608     -0.060      0.952    -968.741     911.288
## ===============================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce2 = model3.predict(dmatrix("bs(train, knots = (0, 20, 40, 57), include_intercept = False)", {"train": x_lim}, return_type = 'dataframe'))

The spline curve on Fig 5. does explain the non-linear relationship between wage and experience.

# plot spline
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce2, color='r', label='Specifying 4 knots, knots = (0, 20, 40, 57)')
plt.legend()
plt.suptitle('Fig 5. Relationship between experience and wage: using cubic regression', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

# Calculating RMSE value
rms3 = sqrt(mean_squared_error(wage, predictions3))
print(rms3)
## 401.4743219404939
  1. Summary By looking at Fig 6, we can see that polynomial curve and spline curve do overlap with each other, and their residual MSE both look similar with spline’s MSE seems to be slightly smaller.
# overlay three regression curve
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5, label = 'Simple Linear Regression')
plt.plot(x_lim_df['exper'], fit_reduce, color = 'blue', linewidth = 1.5, label='Polynomial Regression, experience degree = 2')
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Cubic Regrssion, knots = (0, 20, 40, 57)')
plt.legend()
plt.suptitle('Fig 6. Relationship between experience and wage', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

# compare mse
model = ['SLR', 'Polynomial', 'Spline']
RMSE = [rms1, rms2, rms3]
compare = pd.DataFrame({'Model':model, 'RMSE':RMSE})
print(compare)
##         Model        RMSE
## 0         SLR  425.135354
## 1  Polynomial  401.781447
## 2      Spline  401.474322

Applying Cubic Regression Spline with STATA

Reference

FARAWAY, J. J. (2015). Linear Models With R, Second Edition (2nd ed.). Boca Raton, FL: Taylor & Francis.